Wrapper Induction: Learning (k,l)-Contextual Tree Languages Directly as Unranked Tree Automata
نویسندگان
چکیده
A (k, l)-contextual tree language can be learned from positive examples only; such languages have been successfully used as wrappers for information extraction from web pages. This paper shows how to represent the wrapper as an unranked tree automaton and how to construct it directly from the examples instead of using the (k, l)-forks of the examples. The former speeds up the extraction, the latter speeds up the learning.
منابع مشابه
Parameterless Information Extraction Using (k,l)-Contextual Tree Languages
Recently, several wrapper induction algorithms for structured documents have been introduced. They are based on contextual tree languages and learn from positive examples only but have the disadvantage that they need parameters. To obtain the optimal parameter setting, they use precision and recall. This goes in fact beyond learning from positive examples only. In this paper, a parameter estima...
متن کاملLearning (k, l)-Contextual Tree Languages for Information Extraction
Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this s...
متن کاملTechnical Report No. 2010-567 State Complexity of Unranked Tree Automata
We consider the representational state complexity of unranked tree automata. The bottomup computation of an unranked tree automaton may be either deterministic or nondeterministic, and further variants arise depending on whether the horizontal string languages defining the transitions are represented by a DFA or an NFA. Also, we consider for unranked tree automata the alternative syntactic defi...
متن کاملMultidimensional fuzzy finite tree automata
This paper introduces the notion of multidimensional fuzzy finite tree automata (MFFTA) and investigates its closure properties from the area of automata and language theory. MFFTA are a superclass of fuzzy tree automata whose behavior is generalized to adapt to multidimensional fuzzy sets. An MFFTA recognizes a multidimensional fuzzy tree language which is a regular tree language so that for e...
متن کاملTransformations Between Different Models of Unranked Bottom-Up Tree Automata
We consider the representational state complexity of unranked tree automata. The bottom-up computation of an unranked tree automaton may be either deterministic or nondeterministic, and further variants arise depending on whether the horizontal string languages defining the transitions are represented by a DFA or an NFA. Also, we consider for unranked tree automata the alternative syntactic def...
متن کامل